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Abstract. Real-time heuristic search is a popular model of acting and learning 
in intelligent autonomous agents. Learning real-time search agents improve their 
performance over time by acquiring and refining a value function guiding the 
application of their actions. As computing the perfect value function is typically 
intractable, a heuristic approximation is acquired instead. Most studies of learn- 
ing in real-time search (and reinforcement learning) assume that a simple value- 
function-greedy policy is used to select actions. This is in contrast to practice, 
where high-performance is usually attained by interleaving planning and acting 
via a lookahead search of a non-trivial depth. In this paper, we take a step toward 
bridging this gap and propose a novel algorithm that (i) learns a heuristic function 
to be used specifically with a lookahead-based policy, (ii) selects the lookahead 
depth adaptively in each state, (iii) gives the user control over the trade-off be- 
tween exploration and exploitation. We extensively evaluate the algorithm in the 
sliding tile puzzle testbed comparing it to the classical LRTA* and the more re- 
cent weighted LRTA*, bounded LRTA*, and FALCONS. Improvements of 5 to 
30 folds in convergence speed are observed. 

Keywords: real-time heuristic search, planning and learning, on-line learning, 
adaptive lookahead search. 

1 Problem Formulation 

Complete search methods such as A* and IDA* (T| produce optimal solutions with 
admissible heuristics. The price of optimality is the substantial running time often ex- 
ponential in the problem dimension 1 2 1. This limits the applicability of complete search 
in tasks with large state spaces and limited time per action. The body of research on 
real-time search trades off optimality of solution for running time P j. Consequently, 
these techniques are widely applied to real-time path-planning, game-playing, control, 
and general decision-making. Most approaches to building boundedly rational |4| real- 
time decision-making agents interleave lookahead-based deliberation and backing up 
the information to select an action |3 4 5|. Accordingly, most efforts to increase the 
rationality of such agents fall into three categories: (i) better hand-engineered and au- 
tomatically derived heuristic functions (6|, (ii) various lookahead control, state pruning 
methods, and search extension techniques f?l, and (iii) specialized hardware |7 1. 

In this paper we will focus on the first two ways of increasing rationality of au- 
tonomous decision-making agents. Namely, we will consider the framework of learn- 
ing in real-time heuristic search 1 3 8 1 henceforth referred to as LRTS. It is an attractive 
model of decision-making in autonomous agents since it allows the agent to improve its 
performance over repeated trials in the same environment. Hence, another way to view 
LRTS is in the light of on-line reinforcement learning (RL) 1 5 1 . Not only the learning 
ability enables a gradual improvement of the solution, but also it allows the agent to 
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act with an incomplete model of the environment, within non-stationary environment 
and goals, and in a non-deterministic environment. Consequently, the LRTS model has 
been successfully applied to numerous practical tasks including moving target search 
problems |9J, robot navigation and localization 1 10|, and robot exploration 1. 1 1 J . Several 
important attributes of LRTS algorithms and RL agents are: 

• final solution quality is measured in relation to the optimal solution quality. Having 
upper bounds on the cost of the solution the agent eventually converges to is important 
for performance guarantees; 

• convergence speed is measured in the total number of actions the agent executes 
before it converges to its final solution; 

• resource bounds are imposed on the amount of memory the agent requires to con- 
verge to its final solution. This can be crucial for autonomous agents as their hardware 
is particularly limited; 

• exploration versus exploitation control. As finding high-quality solutions requires 
an extensive exploration of the state space, there is usually a trade-off between the 
final solution quality and the speed of the convergence and the resources required. 
Thus, an important attribute of an LRTS algorithm is whether this trade-off can be 
user-controlled; 

• convergence stability stems from a related exploration and exploitation trade-off. 
Namely, fast convergence to better solutions requires an aggressive exploration of the 
state space. Such an LRTS agent usually demonstrates "optimism in the face of uncer- 
tainty" 1 12 1 by optimistically abandoning already found solution in an eager exploration 
of unknown regions of the state space. As a result, the solution quality can vary by or- 
ders of magnitude in consecutive trials. As argued in 1131 . this may be unacceptable 
depending on the application; 

• integration of learning and planning is critical in most real-world LRTS (and 
learning game-playing) agents as their high performance crucially depends on exten- 
sive lookahead that constitutes the planning step of each cycle |7 14 15 1. Contrary to 
a popular belief, deeper lookahead does not necessarily improve the decision-making 
quality I 16I17I181 . Thus, an adaptive control of lookahead (i.e., planning) |4| and its 
integration with the heuristic function update (i.e., learning) appear promising. 

2 Related Research 

We will now consider several LRTS algorithms in light of the attributes discussed in the 
previous section. LRTA* ||3| is an early and still widely used LRTS algorithm. Under 
certain assumptions, it is guaranteed to convergence to an optimal solution in a finite 
number of trials. Being an optimality-seeking algorithm, LRTA* can require prohibitive 
computational resources (memory), is unstable in convergence 1 13 1, does not provide 
an exploration vs. exploitation control, and does not learn the heuristic function tailored 
to its lookahead module (called 'mini-min' by Korf). 

Several extensions of LRTA* have been proposed. Weighted LRTA* 1 13 1 is a com- 
bination of LRTA* with an inadmissible initial heuristic function. The heuristic is re- 
quired to be within (1 + e) factor of the optimal heuristic function (i.e., e-admissible). 
The e parameter controls the suboptimality of the final solutions as weighted LRTA* 
converges to a solution with the cost within (1 + e) of optimal. Consequently, larger 
values of e lead to faster convergence and smaller memory requirements. The update 
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rule is taken directly from LRTA* and does not take the lookahead into account. As in 
LRTA*, convergence stability is not directly addressed. 

Bounded LRTA* 1 13 1 uses additional memory to maintain an upper bound on the 
true heuristic function in each explored state. There is a user-set parameter 6 used to- 
gether with the upper bound to control the amount of exploration thereby increasing 
convergence stability. On the other hand, bounded LRTA* incurs an additional memory 
and running time overhead as well as additional complexity of acquiring a non-trivial 
upper bound and refining it during the search. Additionally, its optimality seeking nature 
(for S > 2) can lead to intractable storage requirements. Consequently, a combination 
of weighted and bounded LRTA* extensions is needed. The resulting interplay between 
e and S parameters and its effects on the algorithm properties is a subject of future re- 
search 1 13 p. 35]. Finally, there is still no explicit consideration of lookahead in the 
learning module of bounded LRTA* and it is not clear how well the weighting and 
upper-bounding techniques will work with the lookahead search of a non-trivial depth. 

FALCONS 1 19 1 maintains an additional heuristic function which is a lower bound 
on the cost between the current state and the initial state. It was designed to speed up 
the convergence while retaining the optimality of the final solution and therefore can 
require prohibitive storage. Additionally, it does not offer the user any control over the 
exploration vs. exploitation trade-off or convergence stability. Finally, FALCONS does 
not make specific considerations of the lookahead in its learning module. 

3 Novel Learning Real-time Search Algorithm 

Intuition. While past efforts have often focused either on gaining closer approxima- 
tions to the true distance to goal 1 3 6 1 or better management of the planning phase j?], 
we recognize that these two are tightly coupled as the heuristic function is useful only 
insomuch as a guidance to the decision-making module of the agent. Consequently, 
instead of learning or manually engineering heuristic functions that approximate the 
true distance to goal well, we design the learning module specifically for the adaptive 
lookahead search. Below is the intuition behind the novel LRTS algorithm: 

1. Since optimal solutions are inherently intractable in many non-trivial domains 
(e.g., iTl), we allow our algorithm to converge to suboptimal solutions. This re- 
sults in significant speed-ups in convergence and savings in memory. At the same 
time, similar to weighted LRTA*, we upper-bound the converged solutions. 

2. Deep lookahead is wasteful in the states where the optimal action can be determined 
with a shallow search. Thus, our algorithm adjusts the lookahead depth adaptively 
based on the concept of so-called traps introduced later in the paper Additionally, 
we execute a (short) variable-length sequence of actions per one lookahead thereby 
reducing the running time. 

3. We propose a tighter coupling between the heuristic function update and lookahead 
modules by updating the heuristic function only when the agent runs into a trap with 
respect to its lookahead module. This results in fewer heuristic function updates and 
smaller storage requirements. 

4. Similar to weighted LRTA*, we provide user with control over the optimality of 
the solutions and the resources required. The parameter can be selected based on 
the application at hand. 
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Traps. A lookahead based real-time agent such as RTA* or LRTA* conducts lookahead 
of a fixed depth expanding a potentially large number of states at each step. Upon tak- 
ing an action, it updates the heuristic function in its current state and repeats the process 
until the goal state is found. Effectively, it follows the gradient of its heuristic func- 
tion until it reaches a local minimum. This can be visualized as a 'pit' in the heuristic 
function surface where the heuristic values of all surrounding states together with the 
distances to them exceed the heuristic value of the current state. An LRTA*-like agent 
will continue moving inside the 'pit' and raising the heuristic values until the 'pit' is 
completely 'filled'. Such a 'filling process' can take a large number of actions and up- 
dates to the heuristic function. We formalize such 'pits' with a more general concept of 
7-traps and incorporate expUcit and efficient trap detection and recovery methods into 
our LRTS algorithm. 

Search problem. Before we present the algorithm, the search task needs to be formally 
defined. An LRTS problem is defined as a tuple {S, A, c, so,Sg) where 5* is a finite set 
of states; A is a finite set of actions; c : S x A {0, oo) is the cost function with c(s, a) 
being the incurred cost of taking action a in state s; sq is the initial state; and Sg C S 
is the set of goal states. We adopt the assumptions of 1 13 1 that every action is reversible 
in every state, every applicable action leads to another state, and at least one goal state 
in Sg is reachable from sq. 

Then the true cost of traveling from state si to state §2 is denoted by dist(si , S2) and 
is defined as the minimal cumulative action cost the agent is to incur by traveling from 
Si to S2. For any state s its true cost is defined as the minimal travel cost to the nearest 
goal: h*{s) = miris ^5 dist(s,Sg). A heuristic approximation h to the true cost is 
called admissible iff Vs G S [h{s) < h*{s)]. The value of h in state s will be referred 
to as the heuristic cost of state s. Depth d child of state s is any state s' reachable from 
s in the minimum of d actions (denoted by \\s, s'\\ = d). Depth d neighborhood of state 
s is then defined as Ss,d = {sd E S \ \\s, Sd\\ = d}. A state s is called a 7-trap of depth 
d iff it lacks a child of depth up to d such that the cost of getting to that child adjusted 
by 7 plus the heuristic cost of the child does not exceed the heuristic cost of the current 
state: - [3d' e TTTTd 3s' e Ss,d' [7 • clist(s, s') + h{s') < h{s)]] . 

The algorithm. Figure ^presents our LRTS algorithm called 7-Trap as a control pol- 
icy outputting a series of actions a in the state s. Thus, for a given search problem 
{S, A, c, So,Sg) the agent's current s state is initialized to sq and the heuristic function 
h is initialized to an admissible initial function Iiq. The 7-Trap policy is then called re- 
peatedly by the environment and the series of actions a it returns are used to update the 
current state until a goal state is reached. During this process, the heuristic function h is 
updated by the policy as well. Once a goal state is reached, the current problem-solving 
trial is deemed completed and the current state is reset to its initial location so- The 
agent is then to solve the problem again starting with the heuristic function h from the 
previous problem-solving trial. The convergence is achieved when a problem-solving 
trial is completed without any updates to the heuristic function h. The solution produced 
(i.e., the sequence of actions the agent took from sq to the goal state) is considered to 
be the final solution. The cumulative cost of its actions is the final solution cost. 

7-Trap operates as follows. Lines 1 through 5 conduct the lookahead search by 
incrementally expanding the neighborhood of the current state s up to the maximum 
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7-Trap(s,/i, d„,ax,7) 
Input: 

s: the current state 

h: heuristic function 

dmax: maximum lookahead depth 

7: solution quahty control parameter 

Output: a series of actions a 

1 for d = 1 to dniax do 

2 generate depth d neighborhood Ss,d = {sd G -S* | ||s, s^W = d} 

3 if s is not a 7-trap of depth d: i.e., 3s' £ Ss,d [7 • dist(s, s') + h{s') < h{s)] 

4 return actions a leading from s to arg min (7 • dist(s, Sd) + h{sd)) 

5 end for loop 

6 update the heuristic function: /!,(s) ^ max min (7 ■ dist(s, s^) + /i(s£j)) 

d=l,...,d„,ax s^eS^ii 

7 return actions a leading from s to the previous current state 



Fig. 1. LRTS algorithm 7-Trap expressed as agent control policy. 

depth c?max specified by the user At each iteration of the expansion process if the current 
state s is found to be a no«-trap (line 3) then the lookahead search is terminated and 
the series of actions a leading to the most promising state within the last expanded 
neighborhood is returned (line 4). Ties are resolved randomly. Note that this operation 
exits the policy call. The agent will then move to the new cuiTent state s. The heuristic 
function h will not be updated. On the other hand, if the loop terminates naturally 
in line 6 then the current state s is a 7-trap of depth dmax- In this case the heuristic 
function needs to be updated. In order to minimize (i) the number of updates and (ii) 
the number of actions the agent takes while 'filling the pit', we do two things. First, 
we increase the heuristic h{s) by the maximum allowable amount in line 6 (hence max 
over d-neigborhoods). Second, we backtrack to the previous state (line 7) and avoid 
wasteful wandering inside the 'pit' . Note that if the current state s is the initial state sq 
no backtracking is performed and the current state is not changed. 
Completeness & final solution quality. Despite giving up final solution optimality, 
7-Trap is complete and converges to a reasonable final solution. More precisely, the 
amount of optimality loss can be upper-bounded: 

Theorem. For any search problem {S, A, c, sq, Sg) that satisfies the aforementioned 
conditions and given an admissible initial heuristic Hq, j-Trap will find a finite solution 
on every trial. Additionally, over a finite number of repeated trials, j-Trap will converge 
to a final solution of the cost upper-bounded by — ^_i£2) — ^ 

^ J 11 7minc(s,a) 

Sy.A ^ ' 

The proof is available on our website and cannot be reproduced here due to the space 
constraints. Also note that the actual performance of 7-Trap is substantially better than 
the upper bound suggests. For instance, in the convergence experiments detailed below 
the average final solution cost for 7 = 0.2 was only 144% of the optimal which is 
substantially below the upper bound of 500% suggested by the theorem. 

4 Empirical Evaluation 

We evaluated the performance of 7-Trap against existing algorithms: weighted LRTA* 
due to Shimbo and Ishida L13J and LRTA* and RTA* due to Korf [3J which we had re- 
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implemented. Additionally, since we extended the experimental setup used in 1131 . the 
figures for FALCONS and bounded LRTA* reported there are used in the comparisons 
as well.' Our extensions to the experiments reported in fT3] were as follows: (i) we 
used up to 10 folds in order to measure the standard deviation and (ii) the algorithms 
were run with the lookahead depth fixed at 2, 5, 10, and 15 plies in addition to the 
lookahead of one reported therein. Due to the extensive nature of the results and the 
space constraints of the paper, only a subset of the findings will be reported herein. Full 
results can be found in a technical report on our web site. 

LRTA* and RTA* were reimplemented directly from 1 3 1. Additionally, a version of 
LRTA* upgraded with full hash-based lookahead pruning, consistent use of table-based 
h function, forced monotonicity of h + g values, and only upward updates to h (as 
in 1 13 1) was implemented and run under the name of iLRTA*. It was used for weighted 
LRTA* experiments and is listed as eiLRTA-x.x in the figures below where x.x is the 
value of e. Likewise, bounded LRTA* is listed as biLRTA-x.x in the graphs (x.x being 
the value of S). In order to investigate the impact of backtracking in 7-Trap, we also 
implemented a version of the algorithm with no backtracking upon /i-function update. 
Instead, the algorithm always moves into the child node with the minimum h + j dist 
value. In the plots, the backtracking version of 7-Trap is listed as gTrap-BT-x.x whereas 
the non-backtracking version is listed as gTrap-x.x (x.x is the value of 7). Manhattan 
distance was used as the initial heuristic function ho for all algorithms except weighted 
LRTA* where we used Manhattan distance multiplied by (1 + e). 

Convergence Speed. Within each fold, a random set of one hundred 8-puzzles was 
generated. Each of the algorithms (with the exception of non-learning RTA*) was run 
repeatedly on each of the puzzle instances until a convergence was achieved (indicated 
by the lack of /i-function updates between the start and the goal states). The convergence 
cost and the final solution cost relative to the optimal were averaged over 100 puzzles. 
The the mean and standard deviation were computed over 10 folds of 100 puzzles each. 
The experiments were repeated for the lookahead depths of 1, 2, 5, 10, and 15. The 
results for the lookahead of one are shown in Figure|2] As the data indicate, 7-Trap with 
backtracking (gTrap-BT) outperforms all other algorithms in terms of the convergence 
cost: approximately seven times better than the most successful weighted LRTA*, 13 
times better than FALCONS, and 3 1 times better than classic LRTA*. At the same time, 
it is comparable to others in terms of the final solution cost: 144% of optimal vs. 120% 
in the case of the most successful weighted LRTA*. This trend was observed for all 
lookahead depths. 

Memory Requirements. We paralleled the experiments reported in 1 1 3 1 and computed 
the number of 15-puzzles from the one hundred puzzle set published by Korf in 1 1 1 that 
can be solved until convergence within four million heuristic function values stored. 
The results can be found in Figure|3lfor the lookahead of one. In the experiments, 7- 
Trap was the only algorithm converging on all 100 Korf's puzzles within four million 
nodes. At the same time, it has found the highest quality solutions among the algorithms 
(e.g., 1 10% of optimal vs. 520% of optimal with the most successful weighted LRTA*). 

' NB: some of the figures reported in 1 13 1 are incorrect due to a bug in their code. We tiius use 
tiie figures recently re-computed by Shimbo L20J . 
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Convergence Stability. Even though 7-Trap was not designed to minimize the oscil- 
lations in the solution cost during the convergence process, we evaluated it against 
the standard algorithms along that dimension. Paralleling the study of [13], we 
adopted lAE, ISE, ITAE, ITSE, and SOD stability indices. lAE and SOD reported 
in this paper are defined as: lAE = avg^ ^^^^ | cost(s, i) — h*{s)\, SOD = 
avgj i^' cost(s, i + 1) — cost(s, i)) where cost(s, i) is the state s's solu- 

tion cost on zth trial while h*{s) is the optimal cost. The 8-puzzle results for lAE and 
SOD with the lookahead of one are plotted in Figure^] Similar improvements were ob- 
served for deeper lookaheads. The data for bounded LRTA* and FALCONs are taken 
from 1 13 1 with the proper scaling factors of: 10"* for lAE, 10^ for ISE, 10^ for ITAE, 10^ 
for ITSE, and 10'^ for SOD |20|. We observe that 7-Trap with backtracking learns in 
a significantly more stable fashion than all other algorithms including bounded LRTA* 
specifically designed for stable convergence. In particular, for 7 ~ 0.2 it breaks the 
past records by nearly 5 folds (in SOD) and over 14 folds (in lAE). 7-Trap without 
backtracking appears comparable to weighted LRTA*. 

First-trial Performance. We evaluated the first-trial performance of the LRTS algo- 
rithms on random sets of 8-puzzles and Korf's set of 15-puzzles. Namely, within a sin- 
gle fold, each algorithm was presented with 100 puzzles each of which it had to solve 
within 500 thousand moves. The experiments were repeated over 10 independent folds 
each with the lookahead of 1, 2, 5, 10, and 15. The results are plotted in Figure|5l We in- 
cluded the data reported by Shimbo and Ishida in 1 131201 in the bottom graph. The data 
suggest that the superior convergence speed, memory requirements, and learning stabil- 
ity of 7-Trap come at the cost of a more expensive first-trial solution. This behavior is 
similar to that exhibited by FALCONS. Additionally, we note that the backtracking part 
of 7-Trap appears to be the primary factor of its superior learning and inferior first-trial 
performance as the no-backtracking version (gTrap) is similar to the weighted LRTA* 
in both respects. 

5 Future Work & Conclusions 

Learning real-time search is an appealing model for the study of rational autonomous 
decision-making agents. The real-time nature allows to test various strategies of inter- 
leaving planning and acting while the learning feature enables the agent to cope with 
unknown, uncertain, and non-stationary environments. Complete search methods, such 
as IDA*, face the intractability problem apparent even in toy domains (e.g., the scalable 
sliding tile puzzle). Real-time search agents trade intractable optimal solutions for effec- 
tively computable satisfactory solutions. In this paper we discuss important attributes 
of a learning real-time search agent and the associated trade-offs. We then examine 
the existing methods including LRTA* and its extensions as well as FALCONS in the 
light of the desired properties. A novel algorithm, called 7-Trap, is then founded on a 
tighter integration of adaptive lookahead-based planning and learning modules. Under 
the standard assumptions, it is proved to be complete and enjoys convergence to satis- 
factory solutions. We then evaluate it empirically against the existing methods and show 
a significant improvement in the convergence speed and stability as well as much re- 
duced memory requirements. Future research directions include: deriving tighter upper 
bounds on convergence speed, stabiHty, and final solution cost, controlling the amount 
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of exploration on the first trial, automatic selection of the 7 parameter, and applying 
7-Trap in non-deterministic, changing, and unknown environments. 
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